Term Deposit Predictor¶
by Jackson Lu, Daniel Yorke, Charlene Chin , and Mohammed Ibrahim 2025/11/21
Summary¶
This project focuses on predicting whether clients will subscribe to a term deposit using the Bank Marketing dataset. A logistic regression model was developed, incorporating all available predictor variables after appropriate preprocessing. The model was evaluated using shuffled cross-validation with an emphasis on the F1 score balance precision and recall. The analysis was conducted using Python and key libraries such as NumPy, pandas, and scikit-learn, with all code documented for reproducibility. Our final classifier performed fairly well on an unseen test data set, achieving an accuracy of 0.844, f1-score of 0.551, and roc-auc score of 0.91. This indicates that the model is reasonably effective at identifying clients who will subscribe to a term deposit, although there is room for improvement, particularly in recall. Further refinements could involve exploring additional features, tuning hyperparameters, or experimenting with alternative modeling techniques to enhance predictive performance.
Introduction¶
Financial institutions rely heavily on effective marketing strategies to identify which clients are most likely to subscribe to long-term financial products such as term deposits. These products support both customer financial planning and bank stability, yet subscription rates are often low due to ineffective targeting. Traditional marketing approaches depend heavily on human judgment, intuition, and repeated client contact, which can be costly, time-consuming, and inconsistent in effectiveness. As a result, developing more objective and data-driven methods for understanding and predicting client behaviour has become increasingly important.
In this project, we ask whether a machine learning algorithm can accurately predict whether a bank client will subscribe to a term deposit based on demographic attributes, financial information, and past marketing interactions. This question is important because traditional marketing strategies tend to rely on broad outreach rather than individualized prediction, leading to inefficiencies and potential client fatigue. Furthermore, understanding which client characteristics are associated with subscription behavior may support more personalized communication strategies and improve customer experience. If a machine learning classifier such as logistic regression can reliably predict subscription outcomes, it may enable more data-driven, scalable, and cost-effective marketing decisions, ultimately improving the performance of future campaigns.
Methods¶
Data¶
The dataset used in this project is the Bank Marketing dataset, created by By Sérgio Moro, P. Cortez, P. Rita. in 2014 at the University of Minho in Portugal as part of a series of direct marketing campaigns conducted by a Portuguese banking institution. The data is publicly available through the UCI Machine Learning Repository and contains information on client demographics, financial status, and details related to previous marketing contacts. The dataset can be found here.
The dataset contains 45,211 observations and 17 columns in total, comprising 16 predictor variables and 1 binary target variable (y) indicating whether the client subscribed to a term deposit. Each record represents a client who was contacted during a marketing campaign. The predictor variables capture a mix of demographic, financial, and campaign-related information. Among these, several features contain missing values (e.g., job, education, contact, and poutcome), requiring appropriate imputation or handling during preprocessing. Missing categorical values were imputed with a constant placeholder (“unknown”), and numerical features were standardized using StandardScaler to ensure comparability across variables. The target variable y is binary (yes or no), with only around 11–12% of the clients subscribing to a term deposit, resulting in a class imbalance that must be considered in model evaluation. Together, these attributes provide a rich and diverse feature set for assessing whether logistic regression can effectively capture the patterns associated with successful term-deposit subscriptions.
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Analysis¶
A logistic regression classifier was developed to model the probability that a client would subscribe to a term deposit (y). All predictor variables from the original dataset were included after appropriate preprocessing, which involved encoding categorical features with OneHotEncoder and scaling numerical features using StandardScaler. The dataset was randomly divided into a training set (80%) and a test set (20%) to enable unbiased performance evaluation.
Prior exploratory analysis examined the distributions of all input variables in the training set, with plots colored by the binary outcome (“yes” or “no”). Most numerical predictors—such as previous, pdays, campaign, duration, age, and balance—displayed substantial overlap between the two classes. However, some features, particularly duration, showed clear differences: clients who subscribed tended to have significantly longer call durations. This observation is consistent with findings from the original dataset documentation, confirming duration as a strong predictor of subscription. Other variables, such as campaign, previous, and pdays, were highly right-skewed with long tails, while categorical variables (e.g., job, marital status, education, and contact type) appeared to carry complementary contextual information about clients. These exploratory patterns were visualized in Figure 1, which displays feature distributions by subscription status. Figure 2 presents the correlation matrix among numerical predictors.
Correlation matrices (both Pearson and Spearman) were also examined to assess relationships among predictors. Overall, correlations between numerical features were weak, indicating low multicollinearity, which supports the use of logistic regression as an interpretable linear model. Some moderate associations were found among pdays, previous, and campaign, reflecting their shared connection to marketing contact history.
Model evaluation was conducted using stratified 5-fold cross-validation to address class imbalance. Performance was primarily assessed using the F1-score, which balances precision and recall, along with accuracy and ROC-AUC for comprehensive evaluation. Across the five folds, the model achieved a mean accuracy of 0.844, a mean F1-score of 0.551, and a mean ROC-AUC of 0.910. Training and test results were closely aligned, indicating minimal overfitting. These results suggest that the logistic regression model provides strong discriminatory ability, though recall could be improved by further class rebalancing or feature engineering.
All analysis was conducted in Python (Van Rossum & Drake, 2009) using NumPy (Harris et al., 2020), pandas (McKinney, 2010), scikit-learn (Pedregosa et al., 2011), and Altair for visualization. All code for data processing, modeling, and figure generation is documented within this notebook for reproducibility.
Results and Discussion¶
The results demonstrate that logistic regression can effectively distinguish clients likely to subscribe to a term deposit, achieving strong performance across multiple evaluation metrics. The identification of duration as the most influential predictor aligns with expectations—longer calls typically indicate higher engagement and interest in the product. The moderate F1-score, however, reflects difficulty in recalling all positive cases, which was anticipated due to the dataset’s pronounced class imbalance (only around 11–12% subscribed).
These findings highlight the model’s practical potential: banks could apply such a model to prioritize high-probability clients, improving campaign efficiency while reducing unnecessary contact costs. The high ROC-AUC value (0.91) suggests that even a simple, interpretable model can meaningfully support decision-making in marketing strategy.
Future work could explore whether non-linear models (e.g., tree-based or ensemble methods) further improve recall, or whether feature engineering on time-related or interaction variables enhances predictive performance. In addition, investigating the relative influence of demographic versus campaign-related features could deepen understanding of what drives client subscription behavior.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import StratifiedKFold
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
import altair as alt
import altair_ally as ally
from altair import datum
# Enable Altair to render in Jupyter
alt.data_transformers.enable('json', prefix='data/altair/')
DataTransformerRegistry.enable('json')
from ucimlrepo import fetch_ucirepo
# fetch dataset
bank_marketing = fetch_ucirepo(id=222)
# data (as pandas dataframes)
X = bank_marketing.data.features
y = bank_marketing.data.targets
# Define the folder path
folder_path = './data/'
altair_path = './data/altair/'
# Ensure the directory exists (create it if it doesn't)
os.makedirs(folder_path, exist_ok=True)
os.makedirs(altair_path, exist_ok=True)
# Define file paths
features_file_path = os.path.join(folder_path, 'bank_marketing_features.csv')
targets_file_path = os.path.join(folder_path, 'bank_marketing_targets.csv')
# Export the DataFrames to CSV
X.to_csv(features_file_path, index=False) # index=False prevents pandas from writing row indices to the file
y.to_csv(targets_file_path, index=False)
df = pd.concat([X, y], axis=1)
# to ignore warning messages from python ally
warnings.filterwarnings(
"ignore",
message="You passed a `<class 'narwhals.stable.v1.DataFrame'>` to `is_pandas_dataframe`.",
category=UserWarning,
module="altair.utils.data"
)
df.head()
| age | job | marital | education | default | balance | housing | loan | contact | day_of_week | month | duration | campaign | pdays | previous | poutcome | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58 | management | married | tertiary | no | 2143 | yes | no | NaN | 5 | may | 261 | 1 | -1 | 0 | NaN | no |
| 1 | 44 | technician | single | secondary | no | 29 | yes | no | NaN | 5 | may | 151 | 1 | -1 | 0 | NaN | no |
| 2 | 33 | entrepreneur | married | secondary | no | 2 | yes | yes | NaN | 5 | may | 76 | 1 | -1 | 0 | NaN | no |
| 3 | 47 | blue-collar | married | NaN | no | 1506 | yes | no | NaN | 5 | may | 92 | 1 | -1 | 0 | NaN | no |
| 4 | 33 | NaN | single | NaN | no | 1 | no | no | NaN | 5 | may | 198 | 1 | -1 | 0 | NaN | no |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 45211 entries, 0 to 45210 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 45211 non-null int64 1 job 44923 non-null object 2 marital 45211 non-null object 3 education 43354 non-null object 4 default 45211 non-null object 5 balance 45211 non-null int64 6 housing 45211 non-null object 7 loan 45211 non-null object 8 contact 32191 non-null object 9 day_of_week 45211 non-null int64 10 month 45211 non-null object 11 duration 45211 non-null int64 12 campaign 45211 non-null int64 13 pdays 45211 non-null int64 14 previous 45211 non-null int64 15 poutcome 8252 non-null object 16 y 45211 non-null object dtypes: int64(7), object(10) memory usage: 5.9+ MB
df.describe()
| age | balance | day_of_week | duration | campaign | pdays | previous | |
|---|---|---|---|---|---|---|---|
| count | 45211.000000 | 45211.000000 | 45211.000000 | 45211.000000 | 45211.000000 | 45211.000000 | 45211.000000 |
| mean | 40.936210 | 1362.272058 | 15.806419 | 258.163080 | 2.763841 | 40.197828 | 0.580323 |
| std | 10.618762 | 3044.765829 | 8.322476 | 257.527812 | 3.098021 | 100.128746 | 2.303441 |
| min | 18.000000 | -8019.000000 | 1.000000 | 0.000000 | 1.000000 | -1.000000 | 0.000000 |
| 25% | 33.000000 | 72.000000 | 8.000000 | 103.000000 | 1.000000 | -1.000000 | 0.000000 |
| 50% | 39.000000 | 448.000000 | 16.000000 | 180.000000 | 2.000000 | -1.000000 | 0.000000 |
| 75% | 48.000000 | 1428.000000 | 21.000000 | 319.000000 | 3.000000 | -1.000000 | 0.000000 |
| max | 95.000000 | 102127.000000 | 31.000000 | 4918.000000 | 63.000000 | 871.000000 | 275.000000 |
Here we are showing different features distributions
ally.alt.data_transformers.enable('vegafusion')
ally.dist(df, color='y')